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ABSTRACT 

Inaccessible web pages are part of the browsing experience. 
The content of these pages however is often not completely 
lost but rather missing. Lexical signatures (LS) generated 
from the web pages' textual content have been shown to be 
suitable as search engine queries when trying to discover a 
(missing) web page. Since LSs are expensive to generate, 
we investigate the potential of web pages' titles as they are 
available at a lower cost. We present the results from study- 
ing the change of titles over time. We take titles from copies 
provided by the Internet Archive of randomly sampled web 
pages and show the frequency of change as well as the de- 
gree of change in terms of the Levenshtein score. We found 
very low frequencies of change and high Levenshtein scores 
indicating that titles, on average, change little from their 
original, first observed values (rooted comparison) and even 
less from the values of their previous observation (sliding). 

1. INTRODUCTION 

Inaccessible web pages and "404 Page Not Found" re- 
sponses are part of the web browsing experience. Despite 
guidance for how to create "Cool URIs" that do not change 
[2] there are many reasons why URIs or even entire web- 
sites break [l6]. However, we claim that information on the 
web is rarely completely lost, it is just missing. In whole 
or in part, content is often just moving from one URL to 
another. It is our intuition that major search engines like 
Google, Yahoo and MSN Live, as members of what we call 
the Web Infrastructure (WI) , likely have crawled the content 
and possibly even stored a copy in their cache. Therefore 
the content is not lost, it "just" needs to be rediscovered. 
The WI, explored in detail in [21 17 TT] , also includes (be- 
sides search engines) non-profit archives such as the Internet 
Archive (IA) or the European Archive as well as large-scale 
academic digital data preservation projects e.g., CiteSeer 
and NSDL. 

It is commonplace for content to "move" to different URIs 
over time. Figure^ shows two snapshots as an example of 
a web page w hose content has moved within a two year pe- 
riod. Figure l(a)| shows the content of the original URL 
of the Hypertext 2006 conferenc^] as displayed in 1/2009. 
The original URL clearly does not hold conference related 
content anymore. Our suspicion is that the website adminis- 
trators did not renew the domain registration and therefore 
someone else took over. However, the content is not lost. 



new URlj^] This example describes the retrieval problem we 




(a) Original URL, new (unrelated) Content 
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Figure 1(b) shows the content which is now available at a 



(b) Original Content, new URL 



Figure 1: The Content of the Website for the Con- 
ference Hypertext 2006 has Moved over Time 



are addressing with our research. In Figure [2] we are display- 
ing our scenario for discovering web pages that are consid- 
ered missing. The occurrence of an 404 error is displayed in 
the first step. Note that a page returning unrelated content 
(such as in the example above) can be considered missing as 
well since the user intents to retrieve the original content. 
Search engine caches and the IA will consequently be queried 
with the URL requested by the user. In case older copies of 
the page are available they can be offered to the user. If the 
user's information need is satisfied, nothing further needs to 
be done (step (2)). If this is not the case we need to proceed 
to step (3) where we extract titles, try to obtain tags about 
the URL and generate LSs from the obtained copies. They 



2 http: //www.ht06 . org/ 



2 http : / /hypertext . expositus . com/ 



are then queried against live search engines and the returned 
results are again offered to the user as depicted in step (4) 
of Figure [2] In case the user is again not pleased with the 
outcome more sophisticated and complex methods need to 
be applied (step (5)). For example, search engines can be 
queried to discover pages linking to the missing page. The 
assumption is that the aggregate of those pages is likely to 
be about the same topic. From this link neighborhood a LS 
can be generated. At this point the approach is the same 
as the LS method, with the exception that the LS has been 
generated from a link neighborhood and not a cached copy of 
the page itself. This scenario also needs to be applied in case 
no copies of the missing page can be found in search engine 
caches and the IA. The final results are provided in step (6). 
The important point of this scenario is that it works while 
the user is browsing and therefore has to provide results in 
real time. 
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Figure 2: Process to Rediscover Missing Web Pages 

Recent research has shown that lexical signatures (LSs) 
generated from the textual content of web pages are suitable 
as search engine queries to rediscover missing pages [22] [l3j . 
LSs are rather expensive to generate, the web pages' titles 
however are available at a lower cost. We investigated the 
change of web page content compressed into LSs over time 
in [13] and focus here on the issue of title changes over time. 
Our intuition is that if the frequency of change is high, titles 
may not be very useful after all for rediscovering a missing 
web page. In this paper we present the preliminary results 
of a study investigating the frequency and degree of change 



of web pages' titles over time. We predict a lower degree of 
change compared to LSs since LSs are based on the content 
of the entire page which supposedly changes more frequently 
than the general topic captured by the page title. The Ap- 
pendix shows three examples of web pages, their titles as 
observed over time by the IA and our computed similarity 
scores. 

2. RELATED WORK 

2.1 Missing Web Pages 

Missing web pages are a pervasive part of the web ex- 
perience. The lack of link integrity on the web has been 
addressed by numerous researchers [5} [6} [l] [2]. In 1997 
Brewster Kahle published an article focused on preservation 
of Internet resources claiming that the expected lifetime of 
a web page is 44 days [12]. A different study of web page 
availability performed by Koehler [12] shows the random test 
collection of URLs eventually reached a "steady state" after 
approximately 67% of the URLs were lost over a 4- year pe- 
riod. Koehler estimated the half- life of a random web page is 
approximately two years. Lawrence et al. [15] found in 2000 
that between 23 and 53% of all URLs occurring in computer 
science related papers authored between 1994 and 1999 were 
invalid. By conducting a partially manual search on the In- 
ternet, they were able to reduce the number of inaccessible 
URLs to 3%. This confirms our intuition that information is 
rarely lost, it is just moved. This intuition is also supported 
by Baeza- Yates et al. [3] who show that a significant portion 
of the web is created based on already existing content. 

Spinellis [24] conducted a study investigating the accessi- 
bility of URLs occurring in papers published in Communi- 
cations of the ACM and IEEE Computer Society. He found 
that 28% of all URLs were unavailable after five years and 
41% after seven years. He also found that in 60% of the cases 
where URLs where not accessible, a 404 error was returned. 
He estimated the half-life of an URL in such a paper to be 
four years from the publication date. Dellavalle et al. [7] ex- 
amined Internet references in articles published in journals 
with a high impact factor (IF) given by the Institute for 
Scientific Information (ISI). They found that Internet refer- 
ences occur frequently (in 30% of all articles) and are often 
inaccessible within months after publication in the highest 
impact (top 1%) scientific and medical journals. They dis- 
covered that the percentage of inactive references (references 
that return an error message) increased over time from 3.8% 
after 3 month to 10% after 15 month up to 13% after 27 
month. The majority of inactive references they found were 
in the .com domain (46%) and the fewest in the .org do- 
main (5%). By manually browsing the IA they were able to 
recover information for about 50% of all inactive references. 

2.2 Search Engine Queries 

The work done by Henzinger et al. [9] is related in the 
sense that they tried to determine the "aboutness" of news 
documentations. They provide the user with web pages re- 
lated to TV news broadcasts using a 2-term summary which 
can be thought of as a LS. This summary is extracted from 
closed captions of the broadcast and various algorithms are 
used to compute the scores determining the most relevant 
terms. The terms are used to query a news search engine 
while the results must contain all of the query terms. The 
authors found that 1-term queries return results that are too 
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Table 1: URL Character Statistics 

vague and 3-term queries return too often zero results. Thus 
they focus on creating 2-term queries. 

He and Ounis' work on query performance prediction [8] 
is based on the TREC dataset. They measured retrieval 
performance of queries in terms of average precision (AP) 
and found that the AP values depend heavily on the type of 
the query. They further found that what they call simplified 
clarity score (SCS) has the strongest correlation with AP 
for title queries (using the title of the TREC topics). SCS 
depends on the actual query length but also on global knowl- 
edge about the corpus such as document frequency and total 
number of tokens in the corpus. 

2.3 The Web Infrastructure for the 
Preservation of Web Pages 

Nelson et al. [2l] present various models for the preser- 
vation of web pages based on the web infrastructure. They 
argue that conventional approaches to digital preservation 
such as storing digital data in archives and applying meth- 
ods of refreshing and migration are, due to the implied costs, 
unsuitable for web scale preservation. 

McCown has done extensive research on the usability of 
the web infrastructure for reconstructing missing websites 
[IT] . He also developed Warrick [l9], a system that crawls 
web repositories such as search engine caches (characterized 
in [l8]) and the index of the I A to reconstruct websites. His 
system is targeted to individuals and small scale communi- 
ties that are not involved in large scale preservation projects 
and suffer the loss of websites. 

3. EXPERIMENTAL SETUP 
3.1 Data Gathering 

It is the main objective of this experiment is to investigate 
the (degree of) change of web pages' titles over time. It is 
clearly unfeasible to download all pages from the web on a 
regular bases over time and analyze their changes. On the 
other hand it has been shown that finding a small set of web 
pages that are representative for the entire web is not trivial 
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25] . We chose to randomly sample 6,000 URLs 
from the Open Directory Project at dmoz.org There is an 
implicit bias in this selection but it appears more suitable 
than attempting to get an unbiased sample and therefore for 
the sake of simplicity it shall be sufficient. 

We crawled the 6, 000 pages and randomly extracted from 
each of the pages up to three URLs which are referencing 
to locations within the same top level domain. The result- 
ing set theoretically contains 18, 000 URLs. In practice this 
number is lower since a number of URLs did not contain 
any links or were simply inaccessible to the crawler at the 
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Figure 3: Sliding and Rooted Comparison Methods 



time of the crawl in February of 2009. Similar to the filters 
applied in [22] (also with the implicit bias towards English 
language web pages) we dismissed URLs that were not from 
the |.com, .net, .org| or | . edu| domains. In order to inves- 
tigate the temporal change of web page titles we checked 
the availability of all remaining URLs in the IA and found 
copies for a total of 1090 URLs. We call one particular copy 
of a web page in the IA identified by a time stamp an ob- 
servation. We downloaded a total of more than 100, 000 
observations for our 1090 URLs. Table □ summarizes the 
characteristics of all 1090 URLs that have observations in 
the IA. The length of an URL is the number of tokens the 
path to the referenced object contains. For example the 
URLs f oo. bar/ and foo .bar/index. html have a length of 
one and foo. bar/bar/ as well as foo .bar/bar/ index, html 
have a length of two. URLs from the .com domain (70.2%) 
as well as URLs of length one (45.7%) and two (35%) are 
dominant in our sample set. 

3.2 Measures of Change 

With the corpus created we analyze the change of web 
page titles over time with two different measures. Since 
we anticipate a low degree of change we first investigate 
the general frequency of change meaning how often a title 
is modified over the time span covered by all available IA 
observations. 

The second measure is meant to represent the degree of 
change of the titles over time. We use the Levenshtein score 
which captures the minimum number of operations needed 
to transform one title into another and compute it for all 
titles of our corpus. A low Levenshtein score means the 
compared titles are very dissimilar and a high score indicates 
a high level of similarity. The score is different from what 
is known as the Levenshtein distance where the value of 1.0 
means totally dissimilar strings and indicates a match. 
We compute the score in two different ways: the sliding and 
the rooted comparison. To explain the two methods let us 
consider an URL with five observations O1...O5. The sliding 
comparison computes the Levenshtein score between Oi and 
O2, O2 and O3, O3 and O4 and O4 and O5. It continuously 
slides the comparison window forward by one observation, 
hence the name. The rooted method (for the same example) 
will compute the score between Oi and O2, 0\ and O3, 0\ 
and O4 and 0\ and O5 hence we call it a rooted comparison. 
This example is visually represented in Figure [3] 

We used the SimMetrics librar^to compute the Leven- 
shtein scores. 
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Figure 4: Number of Title Changes and Observa- 
tions in the Internet Archive of all URLs 



4. EXPERIMENTAL RESULTS 
4.1 The Number of Changes 

Each time a title changes, that means the captured title 
of observation O n +i is different compared to the title of the 
earlier observation O n the frequency of change is increased 
by one. Figure [4] shows in semi- log scale the number of title 
changes and the total number of IA observations (y-axis) of 
all 1090 URLs (x-axis). The URLs are sorted in increasing 
order by number of observations first and number of title 
changes second. We generally observe a rather low frequency 
of change. The most "inconsistent" URL accounts for 25 title 
changes. This result confirms the intuition that titles are 
more stable than for example LSs of web pages. The number 
of observations of URLs in the IA over time does not impact 
the number of changes of their titles. For example we see 
URLs with thousands of observations having similarly few 
title changes as URLs with less than 50 observations. This 
means that the frequency of title changes in our sample set 
is not biased towards the number of available observations 
in the IA. 

Figure [5] is also plotted in semi-log scale. It displays the 
mean time that has passed between all available IA obser- 
vations as well as the amount of time passed between the 
first and the last observation. Both values are measured in 
days and indicated on the y-axis. The ordering of the URLs 
in this graph is the same as in Figure [4] We can see that 
with the increasing number of IA observations per URL the 
time gap between observation decreases. The overall time 
span passed between the first and the last observation starts 
off high and slightly increases with the rising number of IA 
observations. This result indicates that URLs with many 
observations in the IA have been crawled frequently in the 
past in a rather short period of time and most likely are 
still being crawled with that frequency. It further points to 
an early start of the crawl for such URLs since the overall 
time span of all observations is high. Since the web is grow- 
ing and the IA claims to constantly increase the number of 
pages crawled (|20]) this observation matches our intuition. 



Figure 5: Mean Time Delta Between all Observa- 
tions in the Internet Archive and Entire Time Span 
of Observations (in Days) of all URLs 



However we are not in the position to say whether just the 
frequency of crawls for already indexed pages increased or 
the actual size of the index has increased meaning new pages 
have been discovered, crawled and indexed. For URLs with 
10 or less observations the difference between the two time 
values is hardly noticeable. 

4.2 Degree of Change 

As mentioned above the degree of change is measured us- 
ing the Levenshtein score. The score varies between zero and 
one where one means the titles are identical and zero means 
they are completely dissimilar. The mean sliding scores over 
all observations per URL are shown in Figure [6] These scores 
are generally very high. Only five out of our 1090 URLs have 
a score of zero and more than 85% of all URLs show a score 
of 0.8 or above. That means that titles generally do not 
change drastically between a pair of observations. Slight 
changes are much more likely. The mean rooted values are 
plotted in Figure [7] and they are as we anticipated lower. 
Even though only nine URLs have the zero score just about 
56% of the URLs have a score equal or above 0.8. This result 
confirms that a lot of the titles in our sample set do change 
compared to their first available IA observation but still not 
as dramatic as the pages' content for example (see [l3]). 

5. CONCLUSIONS AND FUTURE WORK 

First results have confirmed our intuition that web pages' 
titles are a good resource for rediscovering missing pages. 
We therefore further investigate the potential of such titles 
by analyzing their frequency and degree of change over time. 
We randomly sampled URLs from dmoz.org and analyzed 
their titles from copies available through the IA. We found a 
low frequency of change. For example, from all URLs with 
at least 10 observations (68%) almost 89% show changes in 
their title only five times or less. We analyzed the degree 
of change by computing Levenshtein scores for a sliding and 
rooted comparison of all observations per URL. The scores 
are very high for the sliding measure. More than five out of 
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Figure 6: Mean Levenshtein Score of all Titles - 
Sliding Comparison 

six URLs have a score of 0.8 which means the title changes 
if at all just slightly. The rooted scores are lower but still 
more than one half of the URLs have a score of 0.8 or above. 

We consider this work as preliminary and see several as- 
pects for future work. Most importantly we will apply vari- 
ous natural language processing techniques to create a "qual- 
ity prediction" model for web pages' titles. The goal is to 
predict how promising any given title is for our purpose in 
order to decide whether to use the title or maybe rather 
generate a LS of the page. Since we are using these titles to 
query search engines (and each query comes with an associ- 
ated cost) in order to rediscover missing web pages, we will 
then be able to automatically dismiss low value titles such as 
Index and Home Page. Another interesting aspect for future 
work is investigating the change of titles for dynamic com- 
pared to static URLs. We can for example identify URLs 
that are passing parameters with a & as dynamic. However 
the difficult part is to determine when URIs that do not 
contain such parameters resolve dynamic content. 
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Appendices 

www.originalbristol.com 

mean Levenshtein score sliding: 0.81 rooted: 0.47 
2007-03-31 

Original 106.5 - Bristol 
2007-04-17 

Original Bristol 106.5 fm Weblog 
2007-04-22 

Original Bristol 106.5 fm prelaunch blog 

2007-05-18 
Original Bristol 

2007-08-16 

Original 106.5 fm - The new radio 
station for Bristol - Original like you 

2007- 12-01 

Original 106.5 - Bristol's Best Music! 

2008- 01-04 
Original 106.5 

www.sun.com/solutions 

mean Levenshtein score sliding: 0.84 rooted: 0.29 

1998- 01-27 

Sun Software Products Selector Guides - 
Solutions Tree 

1999- 02-20 

Sun Software Solutions 



2002-02-01 

Sun Microsystems Products 

2002- 06-01 

Sun Microsystems - Business & Industry 
Solutions 

2003- 08-01 

Sun Microsystems - Industry & 
Infrastructure Solutions 

2004- 02-02 

Sun Microsystems - Solutions 
2004-06-10 

Gateway Page - Sun Solutions 

2006- 01-09 

Sun Microsystems Solutions & Services 

2007- 01-03 

Services & Solutions 

2007- 02-07 

Sun Services & Solutions 

2008- 01-19 
Sun Solutions 

www.datacity.com/mainf.html 

mean Levenshtein score sliding: 0.68 rooted: 0.15 
2000-06-19 

DataCity of Manassas Park Main Page 

2000- 10-12 

DataCity of Manassas Park sells Custom 
Built Computers & Removable Hard Drives 

2001- 08-21 

DataCity a computer company in Manassas 

Park sells Custom Built Computers & Removable 

Hard Drives 

2002- 10-16 

computer company in Manassas Virginia sells 
Custom Built Computers with Removable Hard 
Drives Kits and Iomega 2GB Jaz Drives 
(jazz drives) October 2002 DataCity 
800-326-5051 toll free 

2006-03-14 

Est 1989 Computer company in Stafford 
Virginia sells Custom Built Secure 
Computers with DoD 5200. 1-R Approved 
Removable Hard Drives, Hard Drive Kits 
and Iomega 2GB Jaz Drives (jazz drives) , 
introduces the IllumiNite® lighted 
keyboard DataCity 800-326-5051 Service 
Disabled Veteran Owned Business SDV0B 



